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Abstract — Maximum likelihood is a popular technique for iso- 
form reconstruction. Here, we show that isoform reconstruction 
using short RNA-Seq reads by maximum likelihood is NP-hard. 



I. Introduction 

Isoform reconstruction is a key step in RNA-Seq analysis. 
Tools such as CEM fTl, iReckon fT\, NSMAP [3|, and Monte- 
bello ^4J use maximum likelihood for isoform reconstruction. 
The maximum likelihood approach has been observed to be 
computationally expensive. Here, we show that isoform recon- 
struction using short RNA-Seq reads by maximum likelihood 
is NP-hard. 

II. Results 

A Poisson mixture model ||5)-||7) is used for isoform re- 
construction. We represent a gene as a directed acyclic graph 
G = {V,E) where each vertex in G represents an exon, and 
a path in G represents an isoform of this gene ||8), |[9|. In the 
model |7|, the likelihood of observing Ns at each read 3' end 
location equivalence class s is 
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Here S is the set of all read 3' end location equivalence classes, 
and 

As = ^ aisOi (2) 

where 1^ is the set of isoforms compatible with s, di is isoform 
i's expression level, and a^s > is the sampling rate for read 
3' end location equivalence class s on isoform i |6|. 

To reconstruct isoforms, we seek to maximize the likelihood 
([TJ with a set / consisting of isoforms, and each isoform's 
expression level 9i, i E I. In order to explain all observed 
reads, we must be able to align each read to at least one 
isoform in /. However, because there are a large number of 
possible isoforms, and it is generally believed that a gene 
only has a small number of highly expressed isoforms, we 



instead try to find / and 0i, i E I that maximize the following 
penalized likelihood 
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where ||/||o is the number of 6'^ > 0, i €E /, and K > 
is a real constant. Note that setting k — 5 log(X)seS -^s) 
equivalent to using the Bayesian information criterion pO) for 
an equivalent multinomial model Q, fTT) . 

To show the hardness of isoform reconstruction by maxi- 
mizing the penalized likelihood we consider the following 
decision problem 

M-ISOFORM. 

INSTANCE; A set of reads aligned to a gene where the 
read count at a read 3' end location equivalence class s the 
read count is Ng . 

QUESTION: Does there exist an isoform set I with at most 



isoforms such that Y[s 



> 



Theorem 1. M-ISOFORM is NP-complete. 

III. Discussion 

We can avoid computationally determining a gene's isoform 
set if laboratory protocols can be used to find the gene's 
existing isoforms. It is possible to use methods such as paired- 
end tag sequencing p2) , and single-molecule sequencing | [T3| 
to determine a gene's isoforms. We expect these and related 
technologies to mature in the foreseeable future, reducing the 
demand for computational resources. 

Haplotype inference |14|, |T31, a similar problem , is also 
known to be NP-hard [16J . We believe that haplotype inference 
will also benefit from technologies offering longer sequencing 
reads. 

IV. Proof 

The proof borrows ideas from fTTl, where a network flow 
approach is used for isoform reconstruction. To show that M- 
ISOFORM is is NP-complete, we reduce 3-PARTITION flS) , 
|19|, a strongly NP-complete problem, to M-ISOFORM. We 




Fig. 1: A gene structure for an instance of 3 -PARTITION 

use an approach similar to the one used in pO) , where flows 
are split into paths. 

Proof of Theorem |7} 3-PARTITION is stated in 1 18 1 as 
follows 

3-PARTITION. 

INSTANCE: Set X of Zw elements, a bound Y G Z^", and 
a size u{x) G Z+ for each x G X such that ^ < u{x) < ^ 
and such that X^^ex ^^(2^) — 

QUESTION: Can X be partitioned into w disjoint sets Xi, 
X2, Xw such that, for 1 < i < w, X^aiex ^(2^) — Y (note 
that each Xi must therefore contain exactly three elements 
from X)? 

For an instance of 3-PARTITION, we create a gene with 
structure as shown in Figure ^ Let E e Z+ be a fixed 
constant, we make each exon R bp long, and each read R+1 
bp long. At each bp in exon A, for 1 < i < 3w we have 
u{xi) reads starting at this location going to exon Bi, thus 
there are X^Kzo™ ^('''«) reads starting at this location. For 
1 < I < 3w, at each bp in exon Bi, we have u{xi) reads 
starting at this location going to exon C. At each bp in exon 
C, for 1 < I < ui we have Y reads starting at this location 
going to exon Di, thus there are zY reads starting at this 
location. For 1 < i < w, at each bp in exon Di, we have Y 
reads starting at this location going to exon E. We will show 



that this instance of 3-PARTITION has a solution if and only 
if there exists an isoform set / with at most 3z isoforms such 
that Uses ^TvT^ > Uses 

It is easy to see that Jl^es nJ ^ Uses — 
and only if Vs G S, Xs ^ N^. 

If we have a solution to this instance of 3-PARTITION, it 
is easy to verify that an isoform set with 3z isoforms where 
di = u{xi), 1 < i < 3z, and isoform i consists of exons A, 
Bi, C, Dj (j satisfies Xi G Aj), and _E is a solution to this 
particular instance of M-ISOFORM. 

If we have a solution to this particular instance of M- 
ISOFORM, we show that there also exists a solution to the 
instance of 3-PARTITION. In this case, the isoform set must 
have exactly 3z isoforms, because at least 3z isoforms are 
required to explain all the reads. Thus, for 1 < i < 3z we 
have 9i = u{xi). Because we also have Vs E S, Xg — Ng, 
and Va; G X, ^ < u{x) < ^ we can see that for 1 < j < w 
exon Dj has exactly three isoforms passing through it, and 
Ei passes through D, = Y- Therefore, we have a solution to 
the instance of 3-PARTITION. 

It is easy to see that M-ISOFORM is in NP if we use the 
real RAM model |21 1. Because 3-PARTITION is strongly NP- 
complete |[T8), figlTwe conclude that M-ISOFORM is NP- 
complete. 
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